In the following exercises we will wrangle some data on population from Gapminder.

1

Read in the Gapminder population data and store it as a new object called gap_pop.
The data we need are stored in a .csv file which you can find in the folder ./data/gapminder.
library(readr)

gap_pop <- read_csv("../data/gapminder/population_total.csv")

As you may have noticed, the name of the first column in the dataset does not match its content.

2

Rename the variable Total population to country and store the result in an object with the same name (gap_pop).
library(dplyr)

gap_pop <- gap_pop %>% 
  rename(country = "Total population")

gap_pop
## # A tibble: 275 x 82
##    country  `1800`  `1810`  `1820`  `1830`  `1840`  `1850`  `1860`  `1870`
##    <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Abkhaz~      NA      NA      NA      NA      NA      NA      NA      NA
##  2 Afghan~ 3280000 3280000 3323519 3448982 3625022 3810047 3973968 4169690
##  3 Akroti~      NA      NA      NA      NA      NA      NA      NA      NA
##  4 Albania  410445  423591  438671  457234  478227  506889  552800  610036
##  5 Algeria 2503218 2595056 2713079 2880355 3082721 3299305 3536468 3811028
##  6 Americ~    8170    8156    8142    8128    8114    7958    7564    7057
##  7 Andorra    2654    2654    2700    2835    3026    3230    3436    3654
##  8 Angola  1567028 1567028 1597530 1686390 1813100 1949329 2110747 2285417
##  9 Anguil~    2025    2025    2064    2177    2338    2511    2693    2888
## 10 Antigu~   37000   37000   37000   37000   37000   37000   36532   35546
## # ... with 265 more rows, and 73 more variables: `1880` <dbl>,
## #   `1890` <dbl>, `1900` <dbl>, `1910` <dbl>, `1920` <dbl>, `1930` <dbl>,
## #   `1940` <dbl>, `1950` <dbl>, `1951` <dbl>, `1952` <dbl>, `1953` <dbl>,
## #   `1954` <dbl>, `1955` <dbl>, `1956` <dbl>, `1957` <dbl>, `1958` <dbl>,
## #   `1959` <dbl>, `1960` <dbl>, `1961` <dbl>, `1962` <dbl>, `1963` <dbl>,
## #   `1964` <dbl>, `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>,
## #   `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
## #   `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
## #   `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
## #   `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
## #   `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
## #   `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
## #   `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>,
## #   `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>,
## #   `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>,
## #   `2014` <dbl>, `2015` <dbl>

As you have probably noticed, the data are currently in wide format.

3

Using the data in wide format, select only data for the years 1990 to 1999.
As the values for each year are in separate columns, you need to use the select() function.
gap_pop %>% 
  select(country, "1990":"1999")
## # A tibble: 275 x 11
##    country  `1990`  `1991`  `1992`  `1993`  `1994`  `1995`  `1996`  `1997`
##    <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Abkhaz~ NA      NA      NA      NA      NA      NA      NA      NA     
##  2 Afghan~  1.21e7  1.28e7  1.37e7  1.48e7  1.59e7  1.68e7  1.75e7  1.80e7
##  3 Akroti~  1.41e4  1.42e4  1.43e4  1.44e4  1.45e4  1.46e4  1.47e4  1.48e4
##  4 Albania  3.28e6  3.28e6  3.24e6  3.19e6  3.14e6  3.11e6  3.09e6  3.09e6
##  5 Algeria  2.59e7  2.66e7  2.72e7  2.78e7  2.84e7  2.89e7  2.94e7  2.99e7
##  6 Americ~  4.70e4  4.84e4  4.96e4  5.07e4  5.18e4  5.29e4  5.39e4  5.49e4
##  7 Andorra  5.45e4  5.67e4  5.89e4  6.10e4  6.27e4  6.39e4  6.43e4  6.41e4
##  8 Angola   1.11e7  1.15e7  1.18e7  1.22e7  1.26e7  1.30e7  1.34e7  1.38e7
##  9 Anguil~  8.33e3  8.67e3  8.98e3  9.27e3  9.54e3  9.81e3  1.01e4  1.03e4
## 10 Antigu~  6.19e4  6.24e4  6.34e4  6.49e4  6.66e4  6.83e4  7.02e4  7.22e4
## # ... with 265 more rows, and 2 more variables: `1998` <dbl>, `1999` <dbl>

For the next data wrangling steps, it is more convenient to have the data in long format.

4

Transform the gap_pop dataset into a sensible long format. Name the variable representing the values for population pop and store the resulting dataframe in a name with the same object as before (gap_pop).
This is just a repetition from the Tidy Data exercises. What we want to do is to gather the columns with the years into a year variable.
library(tidyr)

gap_pop <- gap_pop %>% 
  gather(-country, key = "year", value = "pop")

For some analyses, it might help (or even be necessary) to only work with a specific subset of observations.

5

Create two new dataframes that include different subets of the gap_pop data:

  1. Data for all countries for the 19th century (name this one gap_pop_19thcen),

  2. Data for Germany for the years from 2000 onwards (name this one gap_gop_ger_21stcen).
There are several ways to filter the observations according to the above instructions. However, some require more typing than others.
gap_pop_19thcen <- gap_pop %>% 
  filter(year < 1900)

gap_pop_19thcen
## # A tibble: 2,750 x 3
##    country               year      pop
##    <chr>                 <chr>   <dbl>
##  1 Abkhazia              1800       NA
##  2 Afghanistan           1800  3280000
##  3 Akrotiri and Dhekelia 1800       NA
##  4 Albania               1800   410445
##  5 Algeria               1800  2503218
##  6 American Samoa        1800     8170
##  7 Andorra               1800     2654
##  8 Angola                1800  1567028
##  9 Anguilla              1800     2025
## 10 Antigua and Barbuda   1800    37000
## # ... with 2,740 more rows
gap_pop_ger_21stcen <- gap_pop %>% 
  filter(country == "Germany", year > 1999)

gap_pop_ger_21stcen
## # A tibble: 16 x 3
##    country year       pop
##    <chr>   <chr>    <dbl>
##  1 Germany 2000  81895925
##  2 Germany 2001  81809438
##  3 Germany 2002  81699829
##  4 Germany 2003  81569481
##  5 Germany 2004  81417791
##  6 Germany 2005  81246801
##  7 Germany 2006  81055904
##  8 Germany 2007  80854515
##  9 Germany 2008  80665906
## 10 Germany 2009  80519685
## 11 Germany 2010  80435307
## 12 Germany 2011  80424665
## 13 Germany 2012  80477952
## 14 Germany 2013  80565861
## 15 Germany 2014  80646262
## 16 Germany 2015  80688545

For some analyses as well as for plotting the data, it makes sense to define the country variable as a factor.

6

Change the variable types of the dataset: country should be a factor, year and pop should be integers. Again, keep the object name for the resulting dataframe.
You need to use the mutate() function for this.
gap_pop <- gap_pop %>% 
  mutate(country = as.factor(country),
         year = as.integer(year),
         pop = as.integer(pop))

gap_pop
## # A tibble: 22,275 x 3
##    country                year     pop
##    <fct>                 <int>   <int>
##  1 Abkhazia               1800      NA
##  2 Afghanistan            1800 3280000
##  3 Akrotiri and Dhekelia  1800      NA
##  4 Albania                1800  410445
##  5 Algeria                1800 2503218
##  6 American Samoa         1800    8170
##  7 Andorra                1800    2654
##  8 Angola                 1800 1567028
##  9 Anguilla               1800    2025
## 10 Antigua and Barbuda    1800   37000
## # ... with 22,265 more rows

Let’s imagine that we want to combine the population data that we have with some other country-level data (we will discuss joining datasets in session B2 on relational data tomorrow). If the data come from different sources, it is quite likely that the names of the countries differ between them. If we want to join the datasets, we need to harmonize the country names.

7

The gap_pop dataset contains data for the countries Cook Is, Kyrgyz Republic, and Micronesia, Fed. Sts.. Rename them to Cook Island, Kyrgyzstan, and Micronesia.
As we want to use the country variable for joining the datasets in this hypothetical example, we should recode into the same variable.
gap_pop %>% 
  mutate(country = recode(country,
                          "Cook Is" = "Cook Island",
                          "Kyrgyz Republic" = "Kyrgyzstan",
                          "Micronesia, Fed. Sts." = "Micronesia"))
## # A tibble: 22,275 x 3
##    country                year     pop
##    <fct>                 <int>   <int>
##  1 Abkhazia               1800      NA
##  2 Afghanistan            1800 3280000
##  3 Akrotiri and Dhekelia  1800      NA
##  4 Albania                1800  410445
##  5 Algeria                1800 2503218
##  6 American Samoa         1800    8170
##  7 Andorra                1800    2654
##  8 Angola                 1800 1567028
##  9 Anguilla               1800    2025
## 10 Antigua and Barbuda    1800   37000
## # ... with 22,265 more rows

Of course, instead of changing the variable types at this point, we could have also specified the column types when reading in the data (see the session and exercises on importing data).

In the next step, we want to create some new variables based on ones that already exist in the dataset.

8

  1. Create the variable Population in thousands in the gap_pop dataset (name the new variable pop_in_thousands).

  2. Compute the percentage change in population since the previous year for the gap_pop_ger_21stcen dataset (name this new variable pop_perc_change).
To compute the percentage change variable you need the lag() function.
gap_pop %>% 
  mutate(pop_in_thousands = pop/1000)
## # A tibble: 22,275 x 4
##    country                year     pop pop_in_thousands
##    <fct>                 <int>   <int>            <dbl>
##  1 Abkhazia               1800      NA            NA   
##  2 Afghanistan            1800 3280000          3280   
##  3 Akrotiri and Dhekelia  1800      NA            NA   
##  4 Albania                1800  410445           410.  
##  5 Algeria                1800 2503218          2503.  
##  6 American Samoa         1800    8170             8.17
##  7 Andorra                1800    2654             2.65
##  8 Angola                 1800 1567028          1567.  
##  9 Anguilla               1800    2025             2.02
## 10 Antigua and Barbuda    1800   37000            37   
## # ... with 22,265 more rows
gap_pop_ger_21stcen %>% 
  mutate(pop_perc_change = (pop - lag(pop))/pop*100)
## # A tibble: 16 x 4
##    country year       pop pop_perc_change
##    <chr>   <chr>    <dbl>           <dbl>
##  1 Germany 2000  81895925         NA     
##  2 Germany 2001  81809438         -0.106 
##  3 Germany 2002  81699829         -0.134 
##  4 Germany 2003  81569481         -0.160 
##  5 Germany 2004  81417791         -0.186 
##  6 Germany 2005  81246801         -0.210 
##  7 Germany 2006  81055904         -0.236 
##  8 Germany 2007  80854515         -0.249 
##  9 Germany 2008  80665906         -0.234 
## 10 Germany 2009  80519685         -0.182 
## 11 Germany 2010  80435307         -0.105 
## 12 Germany 2011  80424665         -0.0132
## 13 Germany 2012  80477952          0.0662
## 14 Germany 2013  80565861          0.109 
## 15 Germany 2014  80646262          0.0997
## 16 Germany 2015  80688545          0.0524

Finally, let’s combine two basic data wrangling steps to answer an actual question with the data.

9

Which 5 countries were the most and least populous ones in 2015?
To answer this question you need to filter() and arrange() the gap_pop dataset.
gap_pop %>% 
  filter(year == 2015) %>% 
  arrange(-pop) %>% 
  head(n = 5)
## # A tibble: 5 x 3
##   country        year        pop
##   <fct>         <int>      <int>
## 1 China          2015 1376048943
## 2 India          2015 1311050527
## 3 United States  2015  321773631
## 4 Indonesia      2015  257563815
## 5 Brazil         2015  207847528
gap_pop %>% 
  filter(year == 2015) %>% 
  arrange(pop) %>% 
  head(n = 5)
## # A tibble: 5 x 3
##   country                 year   pop
##   <fct>                  <int> <int>
## 1 Holy See                2015   800
## 2 Tokelau                 2015  1250
## 3 Niue                    2015  1610
## 4 Falkland Is (Malvinas)  2015  2903
## 5 St. Helena              2015  3961